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Abstract. Generating descriptions for videos has many applications in¬ 
cluding assisting blind people and human-robot interaction. The recent 
advances in image captioning as well as the release of large-scale movie 
description datasets such as MPII-MD [28] allow to study this task in 
more depth. Many of the proposed methods for image captioning rely 
on pre-trained object classifier CNNs and Long-Short Term Memory re¬ 
current networks (LSTMs) for generating descriptions. While image de¬ 
scription focuses on objects, we argue that it is important to distinguish 
verbs, objects, and places in the challenging setting of movie description. 
In this work we show how to learn robust visual classifiers from the weak 
annotations of the sentence descriptions. Based on these visual classifiers 
we learn how to generate a description using an LSTM. We explore dif¬ 
ferent design choices to build and train the LSTM and achieve the best 
performance to date on the challenging MPII-MD dataset. We compare 
and analyze our approach and prior work along various dimensions to 
better understand the key challenges of the movie description task. 


1 Introduction 

Automatic description of visual content has lately received a lot of interest in 
our community. Multiple works have successfully addressed the image captioning 
problem [6,16,17, 35]. Many of the proposed methods rely on Long-Short Term 
Memory networks (LSTMs) [13]. In the meanwhile, two large-scale movie de¬ 
scription datasets have been proposed, namely MPII Movie Description (MPII- 
MD) [28] and Montreal Video Annotation Dataset (M-VAD) [31]. Both are based 
on movies with associated textual descriptions and allow studying the problem 
how to generate movie description for visually disabled people. Works address¬ 
ing these datasets [28, 33, 39] show that they are indeed challenging in terms of 
visual recognition and automatic description. This results in a significantly lower 
performance then on simpler video datasets (e.g. MSVD [2]), but a detailed anal¬ 
ysis of the difficulties is missing. In this work we address this by taking a closer 
look at the performance of existing methods on the movie description task. 

This work contributes a) an approach to build robust visual classifiers which 
distinguish verbs, objects, and places extracted from weak sentence annotations; 
b) based on the visual classifiers we evaluate different design choices to train 
an LSTM for generating descriptions. This outperforms related work on the 
MPII-MD dataset, both using automatic and human evaluation; c) we perform 
a detailed analysis of prior work and our approach to understand the challenges 
of the movie description task. 
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2 Related Work 

Image captioning. Automatic image description has been studied in the past 
[9,19,20,24], however it regained attention just recently. Multiple works have 
been proposed [6,8,16,17,23,35,37]. Many of them rely on Recurrent Neu¬ 
ral Networks (RNNs) and in particular on Long-Short Term Memory networks 
(LSTMs). Also new datasets have been released, Flickr30k [40] and MS COCO 
Captions [3], where [3] additionally presents a standardized setup for image cap¬ 
tioning evaluation. There are also attempts to analyze the performance of recent 
methods. E.g. [5] compares them with respect to the novelty of generated de¬ 
scriptions and additionally proposes a nearest neighbor baseline that improves 
over recent methods. 

Video description. In the past video description has been addressed in semi- 
realistic settings [1,18], on a small scale [4,11,30] or in constrained scenarios 
like cooking [27, 29]. Most works (with a few exceptions, e.g. [27]) study the task 
of describing a short clip with a single sentence. [6] first proposed to describe 
videos using an LSTM, relying on precomputed CRF scores from [27]. [34] ex¬ 
tended this work to extract CNN features from frames which are max-pooled 
over time. They show the benefit of pre-training the LSTM network for image 
captioning and fine-tuning it to video description. [25] proposes a framework 
that consists of a 2-D and/or 3-D CNN and the LSTM is trained jointly with a 
visual-semantic embedding to ensure better coherence between video and text. 
[38] jointly addresses the language generation and video/language retrieval tasks 
by learning a joint embedding model for a deep video model and compositional 
semantic language model. 

Movie description. Recently two large-scale movie description datasets have been 
proposed, MPII Movie Description (MPII-MD) [28] and Montreal Video Anno¬ 
tation Dataset (M-VAD) [31]. Given that they are based on movies, they cover 
a much broader domain then previous video description datasets. Consequently 
they are much more varied and challenging with respect to the visual content and 
the associated description. They also do not have any additional annotations, 
as e.g. TACoS Multi-Level [27], thus one has to rely on the weak annotations 
of the sentence descriptions. To handle this challenging scenario [39] proposes 
an attention based model which selects the most relevant temporal segments in 
a video and incorporates 3-D CNN and generates a sentence using an LSTM. 
[33] proposes an encoder-decoder framework, where a single LSTM encodes the 
input video frame by frame and decodes it into a sentence, outperforming [39]. 
Our approach for sentence generation is most similar to [6] and we also rely on 
their LSTM implementation based on Caffe [15]. However, we analyze different 
aspects and variants of this architecture for movie description. To extract labels 
from sentences we rely on the semantic parser of [28], however we treat the labels 
differently to handle the weak supervision (see Section 3.1). We show that this 
improves over [28] and [33]. 
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Fig. 1: Overview of our approach. We first train the visual classihers for verbs, 
objects and places, using different visual features; DT (dense trajectories [36]), 
LSDA (large scale object detector [14]) and PLACES (Places-CNN [41]). Next, 
we concatenate the scores from a subset of selected robust classifiers and use 
them as input to our LSTM. 


3 Approach 

In this section we present our two-step approach to video description. The first 
step performs visual recognition, while the second step generates textual descrip¬ 
tions. For the visual recognition we propose to use the visual classifiers trained 
according to the labels’ semantics and “visuality”. For the language generation 
we rely on a LSTM network which has been successfully used for image and video 
description [6,33]. We discuss various design choices for building and training 
the LSTM. An overview of our approach is given in Figure 1. 


3.1 Visual Labels for Robust Visual Classifiers 

For training we rely on a parallel corpus of videos and weak sentence annotations. 
As in [28] we parse the sentences to obtain a set of labels (single words or short 
phrases, e.g. look up) to train our visual classifiers. However, in contrast to [28] 
we do not want to keep all of these initial labels as they are noisy, but select 
only visual ones which actually can be robustly recognized. 

Avoiding parser failure. Not all sentences can be parsed successfully, as e.g. 
some sentences are incomplete or grammatically incorrect. To avoid loosing the 
potential labels in these sentences, we match our set of initial labels to the 
sentences which the parser failed to process. 

Semantie groups. Our labels correspond to different semantic groups. In this 
work we consider three most important groups: verbs (actions), objects and 
places, as they are typically visual. One could also consider e.g. groups like 
mood or emotions, which are naturally harder for visual recognition. We pro¬ 
pose to treat each label group independently. First, we rely on a different repre¬ 
sentation for the each semantic groups, which is targeted to the specific group. 
Namely we use the activity recognition feature Improved Dense Trajectories 
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Fig. 2; (a-c) LSTM architectures, (d) Variants of placing the dropout layer. 


(DT) [36] for verbs, large scale object detector responses (LSDA) [14] for ob¬ 
jects and scene classification scores (PLACES) [41] for places. Second, we train 
one-vs-all SVM classifiers for each group separately. The intuition behind this 
is to discard “wrong negatives” (e.g. using object “bed” as negative for place 
“bedroom”). 

Visual labels. Now, how do we select visual labels for our semantic groups? In 
order to find the verbs among the labels we rely on the semantic parser of [28]. 
Next, we look up the list of “places” used in [41] and search for corresponding 
words among our labels. We look up the object classes used in [14] and search for 
these “objects”, as well as their base forms (e.g. “domestic cat” and “cat”). We 
discard all the labels that do not belong to any of our three groups of interest 
as we assume that they are likely not visual and thus are difficult to recognize. 
Finally, we discard labels which the classifiers could not learn, as these are likely 
to be noisy or not visual. For this we require the classifiers to have have minimum 
area under the ROC-curve (Receiver Operating Characteristic). 


3.2 LSTM for Sentence Generation 

We rely on the basic LSTM architecture proposed in [6] for video description. 
As shown in Figures 1 and 2(a), at each time step, an LSTM generates a word 
and receives the visual classifiers (input-vis) as well as as the previous gener¬ 
ated word (input-lang) as input. To handle natural words we encode each word 
with a one-hot-vector according to their index in a dictionary and a lower di¬ 
mensional embedding. The embedding is jointly learned during training of the 
LSTM. [6] compares three variants: (a) an encoder-decoder architecture, (b) a 
decoder architecture with visual max predictions, and (c) a decoder architecture 
with visual probabilistic predictions. In this work we rely on variant (c) which 
was shown to work best as it can rely on the richest visual input. We analyze 
the following aspects for this architecture: 

Layer structure: We compare a 1-layer architecture with a 2-layer architecture. 
In the 2-layer architecture, the output of the first layer is used as input for the 
second layer (Figure 2b) and was used by [6] for video description. Additionally 
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we also compare to a 2-layer factored architecture [6], where the first layer only 
gets the language as input and the second gets the output of the first layer as 
well as the visual input. 

Dropout placement: To learn a more robust network which is less likely to overfit 
we rely on a dropout [12]. Using dropout a ratio r of randomly selected units is 
set to 0 during training (while all others are multiplied with 1/r). We explore 
different ways to place dropout in the network, he. either for language input 
(lang-drop) or visual (vis-drop) input only, for both inputs (concat-drop) or for 
the LSTM output (Istm-drop), see Figure 2(d). While the default dropout ratio 
is r = 0.5, we evaluate the effect of different ratios. 

Learning strategy: By default we rely on a step-based learning strategy, where a 
learning rate is halved after a certain number of steps. We find the best learning 
rate and step size on the validation set. Additionally we compare this to a poly¬ 
nomial learning strategy, where the learning rate is continuously decreased. The 
polynomial learning strategy has been shown to give good results faster without 
tweaking step size for GoogleNet implemented by Sergio Guadarrama in Gaffe 
[15]. 

4 Evaluation 

In this section we first analyze our approach on the MPII-MD [28] dataset and 
explore different design choices. Then, we compare our best system to prior work. 

4.1 Analysis of our approach 

Experimental setup. We build on the labels discovered by our semantic parser 
[28] and additionally match these labels to sentences which the parser failed to 
process. To be able to learn classifiers we select the labels that appear at least 
30 times, resulting in 1,263 labels. The parser additionally tells us whether the 
label is a verb. We use the visual features (DT, LSDA, PLAGES) provided with 
the MPII-MD dataset [28]. The LSTM output/hidden unit as well as memory 
cell have each 500 dimensions. We train the SVM classifiers on the Training set 
(56,861 clips). We evaluate our method on the validation set (4,930 clips) using 
the METEOR [21] score, which, according to [7,32], supersedes other popular 
measures, such as BLEU [26], ROUGE [22], in terms of agreement with human 
judgments. The authors of GIDEr [32] showed that METEOR also outperforms 
GIDEr when the number of references is small and in the case of MPII-MD we 
have typically only a single reference. 


Robust visual classifiers. In a first set of experiments we analyze our proposal 
to consider groups of labels to learn different classifiers and also to use different 
visual representations for these groups (see Section 3.1). Table 1 we evaluate our 
generated sentences using different input features to the LSTM. In our baseline. 
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Approach 

Classifiers 

Labels Retrieved Trained 

Baseline: all labels treated the same way 




(1) DT 

1263 

- 

6.73 

(2) LSDA 

1263 

- 

7.07 

(3) PLACES 

1263 

- 

7.10 

(4) DT-FLSDA-FPLACES 

1263 

- 

7.24 

Visual labels 




(5) Verbs(DT), Others(LSDA) 

1328 

7.08 

7.27 

(6) Verbs(DT), Places(PLACES), Others(LSDA) 

1328 

7.09 

7.39 

(7) Verbs(DT), Places(PLACES), Objects(LSDA) 

913 

7.10 

7.48 

(8) -|- restriction to labels with ROC > 0.7 

263 

7.41 

7.54 

Baseline: all labels treated the same way, labels from (8) 



(9) DT+LSDA+PLACES 

263 

7.16 

7.20 


Table 1: Comparison of different choices of labels and visual classifiers. All results 
reported on the validation set of MPII-MD. 

in the top part of Table 1, we treat all labels equally, i.e. we use the same 
visual descriptors for all labels. The PLACES feature is best with 7.1 METEOR. 
Combination by stacking all features (DT + LSDA + PLACES) improves further 
to 7.24 METEOR. 

The second part of the table demonstrates the effect of introducing different 
semantic label groups. We first split the labels into “Verbs” and all remaining. 
Given that some labels appear in both roles, the total number of labels increases 
to 1328. We analyze two settings of training the classifiers. In the case of “Re¬ 
trieved” we retrieve the classifier scores from the general classifiers trained in the 
previous step. “Trained” corresponds to training the SVMs specifically for each 
label type (e.g. for “verbs”). Next, we further divide the non-verbal labels into 
“Places” and “Others”, and finally into “Places” and “Objects”. We discard the 
unused labels and end up with 913 labels. Out of these labels, we select the labels 
where the classifier obtains a ROC higher or equal to 0.7 (threshold selected on 
the validation set). After this we obtain 263 labels and the best performance 
in the “Trained” setting. To support our intuition about the importance of the 
label discrimination (i.e. using different features for different semantic groups 
of labels), we propose another baseline (last line in the table). Here we use the 
same set of 263 labels but provide the same feature for all of them, namely the 
best performing combination DT + LSDA + PLACES. As we see, this results 
in an inferior performance. 

We make several observations from Table 1 which lead to robust visual clas¬ 
sifiers from the weak sentence annotations, a) It is beneficial to select features 
based on the label semantics, b) Training one-vs-all SVMs for specific label 
groups consistently improves the performance as it avoids “wrong” negatives, c) 
Eocusing on more “visual” labels helps: we reduce the LSTM input dimension¬ 
ality to 263 while improving the performance. 
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Dropout ratio METEOR 

r=0.1 7.22 

r=0.25 7.42 

r=0.5 7.54 

r=0.75 7.46 

(a) LSTM architectures. (b) Dropout strategies. (c) Dropout ratios. 

Table 2: (a) Different LSTM architectures, used Istm-dropout 0.5. (b) Compar¬ 
ison of different dropout strategies in 1-layer LSTM with dropout value=0.5. 
(c) Comparison of different dropout ratios in 1-layer LSTM with istm-dropout. 
Labels and classifiers from Table 1 (8). On validation set of MPII-MD. 


Architecture METEOR 

1 layer 7.54 

2 layers unfact. 7.54 

2 layers fact. 7.41 


Dropout METEOR 

no dropout 7.19 

lang-drop 7.13 

vis-drop 7.34 

concat-drop 7.29 

Istm-drop 7.54 


Approach METEOR 


lr=0.005, step=2000 

7.30 

lr=0.01, step=2000 

7.54 

lr=0.02, step=2000 

7.51 

lr=0.005, step=4000 

7.49 

lr=0.01, step=4000 

7.59 

lr=0.02, step=4000 

7.28 

lr=0.005, step=6000 

7.40 

lr=0.01, step=6000 

7.40 

lr=0.02, step=6000 

7.32 


(a) Base learning rates 


Approach 

METEOR 

step=2000, iter=25,000 

7.54 

step=4000, iter=25,000 

7.59 

step=6000, iter=25,000 

7.40 

step=8000, iter=25,000 

7.32 

poly, pow=0.5, maxiter=25,000 

7.36 

poly, pow=0.5, maxiter= 10,000 

7.45 

poly, pow=0.7, maxiter=25,000 

7.43 

poly, pow=0.7, maxiter= 10,000 

7.43 


(b) Learning strategies with lr=0.01. 


Table 3: (a) Comparison of different base learning rates, network trained for 
25,000 iterations, (b) Comparison of different learning strategies with base 
lr=0.01. All results reported on the validation Set of MPII-MD. 


LSTM architectures. Now, as described in Section 3.2, we look at different 
LSTM architectures and training configurations. In the following we use the best 
performing “Visual Labels” approach. Table 1 (8). 

We start with examining the architecture, where we explore different con¬ 
figurations of LSTM and dropout layers. Table 2a shows the performance of 
three different networks: “1 layer”, “2 layers unfactored” and “2 layers factored” 
introduced in Section 3.2. As we see, the “1 layer” and “2 layers unfactored” 
perform equally well, while “2 layers factored” is inferior to them. In following 
experiments we use the simplest “1 layer” network. We then compare different 
dropout placements as illustrated in (Figure 2b). We obtain the best result when 
applying dropout after the LSTM layer (“Istm-drop”), while having no dropout 
or applying it only to language leads to stronger over-fitting to the visual fea¬ 
tures. Putting dropout after the LSTM (and prior to a final prediction layer) 
makes the entire system more robust. As for the best dropout ratio, we find that 
0.5 works best with Istm-dropout Table 2c. 

We compare different learning rates and learning strategies in Tables 3a and 
3b. We find that the best learning rate in the step-based learning is 0.01, while 
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Approach 

METEOR 

1 net: Ir 0.01, step 2000, iter=25,000 

7.54 

ensemble of 3 nets 

7.52 

1 net: Ir 0.01, step 4000, iter=25,000 

7.59 

ensemble of 3 nets 

7.68 

1 net: Ir 0.01, step 4000, iter=15,000 

7.55 

ensemble of 3 nets 

7.72 


Table 4: Ensembles of networks with different random initializations. All results 
reported on the validation set of MPII-MD. 


step 4000 slightly improves over step 2000 (which we used in Table 1). We explore 
an alternative learning strategy, namely decreasing learning rate according to a 
polynomial decay. We experiment with different exponents (0.5 and 0.7) and 
numbers of iterations (25K and lOK), using the base-learning rate 0.01. Our 
results show that the step-based learning is superior to the polynomial learning. 

In most of experiments we trained our networks for 25,000 iterations. After 
looking at the METEOR performance for intermediate iterations we found that 
for the step size 4000 at iteration 15,000 we achieve best performance overall. 
Additionally we train multiple LSTMs with different random orderings of the 
training data. In our experiments we combine three in an ensemble, averaging 
the resulting word predictions. In most cases the ensemble improves over the 
single networks in terms of METEOR score (see Table 4). 

To summarize, the most important aspects that decrease over-fitting and 
lead to a better sentence generation are: (a) a correct learning rate and step 
size, (b) dropout after the LSTM layer, (c) choosing the training iteration based 
on METEOR score as opposed to only looking at the LSTM accuracy/loss which 
can be misleading, and (d) building ensembles of multiple networks with different 
random initializations. In the following section we evaluate our best ensemble 
(last line of Table 4) on the test set of MPII-MD. 


4.2 Comparison to related work 

Experimental setup. We compare the best method of [28], the recently proposed 
method S2VT [33] and our proposed “Visual Labels”-LSTM on the test set of 
the MPII-MD dataset (6,578 clips). We report all popular automatic evaluation 
measures, CIDEr [32], BLEU [26], ROUGE [22] and METEOR [21], computed 
using the evaluation code of [3]. We also perform a human evaluation, by ran¬ 
domly selecting 1300 video snippets and asking AMT turkers to rank three 
systems (the best SMT of [28], S2VT [33] and ours) with respect to Correctness, 
Grammar and Relevance, similar to [28]. 

Results. Table 5 summarizes the results on the test set of MPII-MD. While we 
rely on identical features and similar labels as [28], we significantly improve the 
performance in all automatic measures, specifically by 1.44 METEOR points. 
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Approach 

Automatic Score 

CIDEr BLEU@4: ROUGEl METEOR 

Human evaluation: rank 
Correct. Grammar Relev. 

Best SMT of [28] 

8.14 

0.47 

13.21 

5.59 

2.11 

2.39 

2.08 

S2VT [33] 

9.00 

0.49 

15.32 

6.27 

2.02 

1.67 

2.06 

Our 

9.98 

0.80 

16.02 

7.03 

1.87 

1.94 

1.86 

NN Upperbound 

169.64 

9.42 

44.04 

19.43 

- 

- 

- 


Table 5: Comparison of prior work and our proposed method using all popular 
evaluation measures. Human scores in form of ranking from 1 to 3, where lower 
is better. All results reported on the test Set of MPII-MD. 

Moreover, we improve over the recent approach of [33], which also uses LSTM to 
generate video descriptions. Exploring different strategies to label selection and 
classifier training, as well as various LSTM configurations allows to obtain best 
result to date on the MPII-MD dataset. Human evaluation mainly agrees with 
the automatic measures. We outperform both prior works in terms of Correctness 
and Relevance, however we lose to S2VT in terms of Grammar. This is due to 
the fact that S2VT produces overall shorter (7.4 versus 8.7 words per sentence) 
and simpler sentences, while our system generates longer sentences and therefore 
has higher chances to make mistakes. 

We also propose a retrieval upperbound (last line in Table 5). For every test 
sentence we retrieve the closest training sentence according to the METEOR. 
The rather low METEOR score of 19.43 reflects the difficulty of the dataset. 

A closer look at the sentences produced by all three methods gives us addi¬ 
tional insights. An interesting characteristic is the output vocabulary size, which 
is 94 for [28], 86 for [33] and 605 for our method, while the test set contains 
6422 unique words. This clearly shows a higher diversity of our output. Among 
the words generated by our system and absent in the outputs of others are such 
verbs as grab, drive, sip, climb, follow, objects as suit, chair, cigarette, mirror, 
bottle and places as kitchen, corridor, restaurant. We showcase some qualitative 
results in Figure 3. Here, e.g. the verb pour, object drink and place courtyard 
only appear in our output. We attribute this, on one hand, to our diverse and ro¬ 
bust visual classifiers. On the other hand, the architecture and parameter choices 
of our LSTM allow us to learn better correspondance between words and visual 
classifiers’ scores. 


5 Analysis 

Despite the recent advances in the video description domain, including our pro¬ 
posed approach, the video description performance on the movie description 
datasets (MPII-MD [28] and M-VAD [31]) remains relatively low. In this section 
we want to take a closer look at three methods, best SMT of [28], S2VT [33] and 
ours, in order to understand where these methods succeed and where they fail. 
In the following we evaluate all three methods on the MPII-MD test set. 
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Approach Sentence 



SMT [28] 
S2VT [33] 
Our 


Reference 


SMT [28] 
S2VT [33] 
Our 

Reference 

SMT [28] 

S2VT [33] 
Our 

Reference 


Someone is a man, someone is a man. 

Someone looks at him, someone turns to someone. 
Someone is standing in the crowd, 
a little man with a little smile. 

Someone, back in elf guise, is trying to calm the kids. 


The car is a water of the water. 

On the door, opens the door opens. 

The fellowship are in the courtyard. 

They cross the quadrangle below and run along the cloister. 
Someone is down the door, 

someone is a back of the door, and someone is a door. 
Someone shakes his head and looks at someone. 

Someone takes a drink and pours it into the water. 

Someone grabs a vodka bottle standing open on the counter 
and liberally pours some on the hand. 


Fig. 3: Qualitative comparison of prior work and our proposed method. Examples 
from the test set of MPII-MD. Our approach identifies activities, objects, and 
places better than related work. 



(a) Sentence length 


(b) Word frequency 


Fig. 4; METEOR score per sentence, (a) Test set sorted by sentence length (in¬ 
creasing). (b) Test set sorted by word frequency (decreasing). Shown values are 
smoothed with a mean filter of size 500. 


5.1 Difficulty versus performance 

As the first study we suggest to sort the reference sentences (from the test set) 
by difficulty, where difficulty is defined in multiple ways. 

Sentence length and Word frequency. Two of the simplest sentence 
difficulty measures are its length and average frequency of words. When sorting 
the data by difficulty (increasing sentence length or decreasing average word 
frequency), we find that all three methods have the same tendency to obtain 
lower METEOR score as the difficulty increases (Figures 4a and 4b). For the 
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(a) Textual dificulty (b) Visual difficulty 


Fig. 5; METEOR score per sentence, (a) Test set sorted by Textual NN score 
(decreasing), (b) Test set sorted by Visual kNN score, A; = 10 (decreasing). 
Shown values are smoothed with a mean filter of size 500. 


word frequency the correlation is stronger. Our method consistently outperforms 
the other two, most notable as the difficulty increases. 

Textual and Visual Nearest Neighbors. Next, for each reference test 
sentence we search for the closest training sentence (in terms of the METEOR 
score). We use the obtained best scores to sort the reference sentences by textual 
difficulty, i.e. the “easy” sentences are more likely to be retrieved. If we consider 
all training sentences, we obtain a Textual Nearest Neighbor. We sort the test 
sentences according to these scores (decreasing) and plot the performance of 
three methods in Eigure 5a. All methods “agree” and ours is best throughout 
the difficulty range, in particular in the more challenging part of the plot. We 
can also use visual features to find the k Nearest Neighbors in the Training set, 
select the best one (in terms of the METEOR score) and use this score to sort 
the reference sentences. We call this a Visual k Nearest Neighbor. The intuition 
behind it is to consider a video clip as visually “easy” if the most similar training 
clips also have similar descriptions (the “difficult” clip might have no close visual 
neighbours). We rely on our best visual representation (8) from Table 1 and cos 
similarity measure to define the Visual kNN and sort the reference sentences 
according to it with A: = 10 (Eigure 5b). We see a clear correlation between the 
visual difficulty and the performance of all methods (Eigure 5b). 

Summary, a) All methods perform better on shorter, common sentences 
and our method notably wins on longer sentences, b) Our method also wins 
on sentences that are more difficult to retrieve, c) Visual difficulty, defined by 
cos similarity and representation (8) from Table 1, strongly correlates with the 
performance of all methods, (d) When comparing all four plots (Eigures 4a 
and 4b, Eigures 5a and 5b), we find that the strongest correlation between the 
methods’ performance and the difficulty is observed for the Textual difficulty, 
while the least correlation we observe for the Sentence length. 
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Fig. 6: Average METEOR score for WordNet verb topics. Selected sentences with 
single verb, number of sentences in brackets. 


Topic 

Entropy Top-1 

Top-2 

Top-3 

Top-4 

Top-5 

motion 

7.05 

turn 

walk 

shake 

move 

go 

contact 

7.10 

Open 

sit 

stand 

hold 

pull 

perception 

4.83 

look 

stare 

see 

watch 

gaze 

stative 

4.84 

be 

follow 

stop 

go 

wait 

change 

6.92 

reveal 

start 

emerge 

fill 

make 

communication 6.73 

look up nod 

face 

speak 

talk 

body 

5.04 

smile 

wear 

dress 

grin 

glare 

social 

6.11 

watch 

join 

do 

close 

make 

cognition 

5.21 

look at 

see 

read 

take 

leave 

possession 

5.29 

give 

take 

have 

stand in 

find 

none 

5.04 

throw 

hold 

fly 

lie 

rush 

creation 

5.69 

hit 

make 

do 

walk through come i 

competition 

5.19 

drive 

walk over point 

play 

face 

consumption 

4.52 

use 

drink 

eat 

take 

sip 

emotion 

6.19 

draw 

startle 

feel 

touch 

enjoy 

weather 

3.93 

shine 

blaze 

light up drench 

blow 


Table 6: Entropy and top 5 frequent verbs of each WordNet topic in the MPII- 
MD. 


5.2 Semantic analysis 

WordNet Verb Topics. We closer analyze the test sentences with respect 
to different verbs. Eor this we rely on WordNet topics (high level entries in 
the WordNet ontology, e.g. “motion”, “perception”, “competition”, “emotion”), 
defined for most synsets in WordNet [10]. We obtain the sense information from 
the semantic parser of [28], thus senses might be noisy. We showcase the 5 most 
frequent verbs for each topic in Table 6. We select sentences with a single verb, 
group them according to the verb topic and compute an average METEOR 
score for each topic, see Eigure 6. We find that our method is best for all topics 
except “communication”, where [28] wins. The most frequent verbs in this topic 
are “look up” and “nod”, which are also frequent in the dataset and in the 
sentences produced by [28]. The best performing topic, “cognition”, is highly 
biased to “look at” verb. The most frequent topics, “motion” and “contact”. 
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which are also visual (e.g. “turn”, “walk”, “open”, “sit”), are nevertheless quite 
challenging, which we attribute to their high diversity (see their entropy w.r.t. 
different verbs and their frequencies in Table 6). At the same time “perception” 
is far less diverse and mainly focuses on verbs like “look” or “stare”, which are 
quite frequent in the dataset, resulting in better performance. Topics with more 
abstract verbs (e.g. “be”, “have”, “start”) tend to get lower scores. 

Top 100 best and worst sentences. We look at 100 Test sentences, where 
our method obtains highest and lowest METEOR scores. Out of 100 best sen¬ 
tences 44 contain the verb “look” (including verb phrases such as “look at”). 
The other frequent verbs are “walk”, “turn”, “smile”, “nod”, “shake”, “stare”, 
“sit”, i.e. mainly visual verbs. Overall the sentences are simple and common. 
Among the 100 lowest scoring sentences we observe more diversity: 12 sentences 
contain no verb, 10 mention unusual words (specific to the movie), 24 contain no 
subject, 29 have a non-human subject. Altogether this leads to a lower perfor¬ 
mance, in particular, as most training sentences contain “Someone” as subject 
and generated sentences are biased towards it. 

Summary, a) The test sentences that mention the verb “look” (and similar) 
get higher METEOR scores due to their high frequency in the dataset, b) The 
sentences with more “visual” verbs tend to get higher scores, c) The sentences 
without verbs (e.g. describing a scene), without subjects or with non-human 
subjects get lower scores, which can be explained by a dataset bias towards 
“Someone” as subject. 

6 Conclusion 

We propose an approach to automatic movie description which trains visual 
classifiers and uses the classifier scores as input to LSTM. To handle the weak 
sentence annotations we rely on three main ingredients. Eirst, we distinguish 
three semantic groups of labels (verbs, objects and places), second we train them 
discriminatively, removing potentially noisy negatives, and third, we select only a 
small number of the most reliable classifiers. Eor sentence generation we show the 
benefits of exploring different LSTM architectures and learning configurations. 
As the result we obtain the highest performance on the MPII-MD dataset as 
shown by all automatic evaluation measures and extensive human evaluation. 

We analyze the challenges in the movie description task using our and two 
prior works. We find that the factors which contribute to higher performance 
include: presence of frequent words, sentence length and simplicity as well as 
presence of “visual” verbs (e.g. “nod”, “walk”, “sit”, “smile”). Textual and vi¬ 
sual difficulties of sentences/clips strongly correlate with the performance of all 
methods. We observe a high bias in the data towards humans as subjects and 
verbs similar to “look”. Euture work has to focus on dealing with less frequent 
words and handle less visual descriptions. This potentially requires to consider 
external text corpora, modalities other than video, such as audio and dialog, and 
to look across multiple sentences. This would allow exploiting long- and short- 
range context and thus understanding and describing the story of the movie. 
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